7  Data Transformation: Organizing rows & Columns

Learning Objectives

After completing this lab you should be able to

For each of our modules we will have a project-folder with an Rproject, *.qmd-files, and sub-directories for data, scripts, and results as described in our Rproject Tutorial. You should have a directory on your Desktop or Documents folder on your laptop (name it something like bi349) as a home directory for all of our project folders this semester.

You should have already downloaded the project directory for this module, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.

Once you have opened a project you should see the project name in the top right corner1.

  • 1 Pro tip: If you run into issues where a quarto document won’t render or file paths aren’t working (especially if things were working previously) one of your first steps should be to double check that the correct Rproj is loaded.

  • There should be a document named 07_data-transformation-i.qmd in your project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set2 you may of course add as many comments as you need to be able to recall what you did]. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.

  • 2 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.

  • Let’s start by loading the packages we will need for this activity.

    # load libraries
    library(knitr)
    library(tidyverse)

    7.1 Data Wrangling

    Now that we have a data frame to play with, let’s explore some data wrangling options using dplyr. The d stands for data and the plyr stands for plyers - this package is designed to manipulate data frames3. We are going to focus on the central actions (verbs) that will allow you to manipulate the data frame.

  • 3 This is also known as data wrangling or data munging, manipulating is not a negative thing in this case it means you can clean up and format the data in appropriate way to fit the questions you are asking and allow to to transform the information in a helpful way so that you can apply analysis and modeling as needed in the next step

  • The main advantages to using a command line program like R/code compared to a spreadsheet program such as Excel or Google sheets are:

    • You aren’t manipulating the raw data set - if you make a mistake or accidentally overwrite something you haven’t made any permanent damage.
    • You can manipulate data sets too large to easily handle in a spreadsheet
    • If you update your data set or have a second identically formatted data set you just have to re-run the code.

    Be sure to record all the steps (code chunks) in your quarto document - both the examples given here and the applications you will be asked to make. You can copy and paste, but you will find that writing out the code will help you get more used to syntax, how auto complete etc. works. Be sure to annotate/comment your code as reminders while we go through new functions in class, and that you take the time to go over your comments before submitting your knotted *.html document.

    These are central concepts that you will use and reuse throughout the semester so you will likely want to refer back to this document. A good way to create a “cheatsheet” would be to for example for each function write a short description of what it does in general before each code chunk, then make your comment in the code specific to your example. Similarly use normal text to refer to the question numbers in this manual as you work through the problem sets.

    7.2 Selecting and organizing columns

    Let’s start by loading our data set.

    # read catch data
    catch <- read_delim("data/longline_catchdata.txt", delim = "\t")
    Be mindful

    We will make heavy use of the magrittr pipe %>% this smester which allows you to link commands together; think of it as “and now do this”. R for Data Science (2e) implements the native R pipe |>. For our intents and purposes they are identical.

    Because we are mostly interested in what the individual functions (verbs) do we will not always assign a new object, but just having it print to the console/below the code chunk we will be able to immediately assess the affect. By piping our function to head() it will print just the first 6 lines.

    The function select() is used to select a subset of columns from a data set.

    For example, you can select just the Site and Species columns4.

  • 4 Remember, the function head() allows you to just print the first few lines of the dataframe to the console, otherwise you can end up with several thousand lines!

  • catch %>%
      select(Site, Species) %>%
      head()
    # A tibble: 6 × 2
      Site        Species      
      <chr>       <chr>        
    1 Aransas_Bay Bagre_marinus
    2 Aransas_Bay Bagre_marinus
    3 Aransas_Bay Bagre_marinus
    4 Aransas_Bay Bagre_marinus
    5 Aransas_Bay Bagre_marinus
    6 Aransas_Bay Bagre_marinus
    Give it a whirl

    How would you select just Day, Month, and Year columns?

    You can also specify individual columns to eliminate by name. For example, the PCL column doesn’t contain any information (all NAs).

    catch %>%
      select(-PCL) %>%
      head()
    # A tibble: 6 × 11
      Site      Species Sex   Observed_Stage    FL   STL Hook_Size   Set   Day Month
      <chr>     <chr>   <chr> <chr>          <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>
    1 Aransas_… Bagre_… U     <NA>             287   353        10     1    28     7
    2 Aransas_… Bagre_… U     <NA>             425   495        10     1    28     7
    3 Aransas_… Bagre_… U     <NA>             416   502        15     1    28     7
    4 Aransas_… Bagre_… U     <NA>             416   507        10     1    28     7
    5 Aransas_… Bagre_… U     <NA>             418   510        15     1    28     7
    6 Aransas_… Bagre_… U     <NA>             434   515        10     1    28     7
    # ℹ 1 more variable: Year <dbl>
    Give it a whirl

    How would you eliminate hook size from the data set?

    You can also eliminate multiple columns by name, for example you would remove Day, Month and Year like this:

    catch %>%
      select(-Day, -Month, -Year) %>%
      head()
    # A tibble: 6 × 9
      Site        Species     Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set
      <chr>       <chr>       <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl>
    1 Aransas_Bay Bagre_mari… U     <NA>              NA   287   353        10     1
    2 Aransas_Bay Bagre_mari… U     <NA>              NA   425   495        10     1
    3 Aransas_Bay Bagre_mari… U     <NA>              NA   416   502        15     1
    4 Aransas_Bay Bagre_mari… U     <NA>              NA   416   507        10     1
    5 Aransas_Bay Bagre_mari… U     <NA>              NA   418   510        15     1
    6 Aransas_Bay Bagre_mari… U     <NA>              NA   434   515        10     1

    If you want to re-arrange columns in your data frame, you would also use select().

    catch %>%
      select(FL, Sex, Day) %>%
      head()
    # A tibble: 6 × 3
         FL Sex     Day
      <dbl> <chr> <dbl>
    1   287 U        28
    2   425 U        28
    3   416 U        28
    4   416 U        28
    5   418 U        28
    6   434 U        28
    Protip

    If you wanted to move a set of columns to the front, but not not want to have to type in all the other column names you can use everything().

    catch %>%
      select(Day, Month, Year, everything()) %>%
      head()
    # A tibble: 6 × 12
        Day Month  Year Site        Species   Sex   Observed_Stage   PCL    FL   STL
      <dbl> <dbl> <dbl> <chr>       <chr>     <chr> <chr>          <dbl> <dbl> <dbl>
    1    28     7  2015 Aransas_Bay Bagre_ma… U     <NA>              NA   287   353
    2    28     7  2015 Aransas_Bay Bagre_ma… U     <NA>              NA   425   495
    3    28     7  2015 Aransas_Bay Bagre_ma… U     <NA>              NA   416   502
    4    28     7  2015 Aransas_Bay Bagre_ma… U     <NA>              NA   416   507
    5    28     7  2015 Aransas_Bay Bagre_ma… U     <NA>              NA   418   510
    6    28     7  2015 Aransas_Bay Bagre_ma… U     <NA>              NA   434   515
    # ℹ 2 more variables: Hook_Size <dbl>, Set <dbl>

    There you go, creating subsets of columns: Simple as that.

    7.3 Separating & uniting columns

    Occasionally you will find that you want to combine the contents of two columns into a single column (e.g. first name, last name) or at other times you may want to separate the contents of a column over multiple columns (e.g. dates).

    For example, you may have noticed that the Species is entered as genus_species - what if you wanted to have two separate columns with that information?

    The function separate() will split the contents from one column across two or more columns. To do this you need to specify the new column names (into = c("column1", "column2")), and what pattern should be used to determine where the content should be split (sep = "pattern").

    catch %>%
      separate(Species, into = c("species", "genus"), sep = "_", remove = FALSE) %>%
      head()
    # A tibble: 6 × 14
      Site    Species species genus Sex   Observed_Stage   PCL    FL   STL Hook_Size
      <chr>   <chr>   <chr>   <chr> <chr> <chr>          <dbl> <dbl> <dbl>     <dbl>
    1 Aransa… Bagre_… Bagre   mari… U     <NA>              NA   287   353        10
    2 Aransa… Bagre_… Bagre   mari… U     <NA>              NA   425   495        10
    3 Aransa… Bagre_… Bagre   mari… U     <NA>              NA   416   502        15
    4 Aransa… Bagre_… Bagre   mari… U     <NA>              NA   416   507        10
    5 Aransa… Bagre_… Bagre   mari… U     <NA>              NA   418   510        15
    6 Aransa… Bagre_… Bagre   mari… U     <NA>              NA   434   515        10
    # ℹ 4 more variables: Set <dbl>, Day <dbl>, Month <dbl>, Year <dbl>
    Give it a whirl

    In some cases, there might not be a distinct pattern that you can use to identify where to split the column content. In this case it may be more helpful to use the position (e.g. “split at”third character from the left”) Look up the separate() function in the help tab and determine how you could split the Year column so you get two new columns by splitting off the last two digits (i.e. 2021 would be 20 and 21). Then eliminate the column containing the first two digits.

    In other cases you might have information in two columns that you want to combine into a single column. This can be accomplished using the function unite().

    For example, if we wanted to create a column called date that had the day, month, and year of each sampling trip separated by an _.

    catch %>%
      unite(Date, Day, Month, Year, sep = "_", remove = FALSE) %>%
      head()
    # A tibble: 6 × 13
      Site      Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set Date 
      <chr>     <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <chr>
    1 Aransas_… Bagre_… U     <NA>              NA   287   353        10     1 28_7…
    2 Aransas_… Bagre_… U     <NA>              NA   425   495        10     1 28_7…
    3 Aransas_… Bagre_… U     <NA>              NA   416   502        15     1 28_7…
    4 Aransas_… Bagre_… U     <NA>              NA   416   507        10     1 28_7…
    5 Aransas_… Bagre_… U     <NA>              NA   418   510        15     1 28_7…
    6 Aransas_… Bagre_… U     <NA>              NA   434   515        10     1 28_7…
    # ℹ 3 more variables: Day <dbl>, Month <dbl>, Year <dbl>
    Give it a whirl

    Create a data set with the following columns in this sequence and print the first few rows to the console.

    • Set_ID (combining day, month, year, and set)
    • Genus
    • Species
    • FL
    • STL

    7.4 Sorting dataframes by a specific column content

    Until you want to visualize a table how the rows are arranged is not really important. However, for example, when generating reports you might want values to be listed in a specific way. This can be done using the function arrange().

    For example, if we wanted to sort our dataframe based on the Observed_Stage column we could do the following:

    catch %>%
      arrange(Observed_Stage)
    # A tibble: 2,325 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Redfish… Sphyrn… M     MAT              622   668   850        10     1     8
     2 Redfish… Sphyrn… M     MAT              656   710   869        10     1     8
     3 Redfish… Sphyrn… F     MAT              708   770   979        15     2     8
     4 Corpus_… Sphyrn… M     MAT              695   757   954        10     2    12
     5 Corpus_… Sphyrn… F     MAT              760   861  1090        10     2    12
     6 Corpus_… Sphyrn… M     MAT              621   689   856        10     2    27
     7 Redfish… Sphyrn… F     MAT              781   853  1020        10     4    29
     8 Redfish… Sphyrn… M     MAT              721   783   980        10     3    11
     9 Redfish… Carcha… U     UND               NA    NA    NA        15     2    16
    10 Corpus_… Sphyrn… U     UND               NA    NA    NA        10     1    27
    # ℹ 2,315 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    How would sort your table by Sex?

    By default, characters are sorted alphabetically, numeric columns from smallest to largest value. If you want to order your values from largest to smallest, you can specify that using desc()

    catch %>%
      arrange(desc(FL))
    # A tibble: 2,325 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Aransas… Carcha… F     <NA>            1042  1140  1410        15     1    25
     2 Redfish… Carcha… F     <NA>             812   900  1090        15     1    16
     3 Redfish… Carcha… M     <NA>             792   882  1092        15     2    16
     4 Corpus_… Sphyrn… F     MAT              760   861  1090        10     2    12
     5 Redfish… Sphyrn… F     MAT              781   853  1020        10     4    29
     6 Corpus_… Sciaen… U     <NA>              NA   841   950        10     3    25
     7 Redfish… Carcha… M     <NA>             740   840  1010        15     3    29
     8 Redfish… Carcha… M     <NA>             740   820  1020        10     4     1
     9 Aransas… Carcha… M     <NA>             720   812   912        15     4    22
    10 Redfish… Sphyrn… M     MAT              721   783   980        10     3    11
    # ℹ 2,315 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    How would sort your site column from Z to A?

    7.5 Filtering (subsetting) rows

    Frequently, we are less interested in being able to sort columns by content, rather, we want to extract a subset of rows based on specific content.

    The function filter() is used to subset a data frame by row based on regular expressions and the boolean operators we previously encounter to describe the content of sets of rows.

    For example, we might a data.frame with only Gafftop sail catfish (Bagre marinus)5.

  • 5 Remember for exact matches we use == not =

  • catch %>%
      filter(Species == "Bagre_marinus")
    # A tibble: 1,511 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Aransas… Bagre_… U     <NA>              NA   287   353        10     1    28
     2 Aransas… Bagre_… U     <NA>              NA   425   495        10     1    28
     3 Aransas… Bagre_… U     <NA>              NA   416   502        15     1    28
     4 Aransas… Bagre_… U     <NA>              NA   416   507        10     1    28
     5 Aransas… Bagre_… U     <NA>              NA   418   510        15     1    28
     6 Aransas… Bagre_… U     <NA>              NA   434   515        10     1    28
     7 Aransas… Bagre_… U     <NA>              NA   427   520        15     1    28
     8 Aransas… Bagre_… U     <NA>              NA   446   532        10     1    28
     9 Aransas… Bagre_… U     <NA>              NA   465   538        10     1    28
    10 Aransas… Bagre_… U     <NA>              NA   450   539        10     1    28
    # ℹ 1,501 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    How would you select only rows containing Scalloped Hammerheads (Sphyrna lewini)?”

    If we want all rows but Gafftop sailfish you can use a ! to say “not that” instead of having to list all the species that you do want to keep6.

  • 6 This is frequently called “blacklisting”, while creating a list of content that you do want to keep would be referred to as “whitelisting”.

  • catch %>%
      filter(!Species == "Bagre_marinus")
    # A tibble: 814 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Redfish… Rhizop… M     <NA>             351   378   433        15     1    29
     2 Redfish… Sphyrn… F     <NA>             470   430   600        10     3    29
     3 Redfish… Hypanu… F     <NA>              NA    NA   670        15     3    29
     4 Redfish… Hypanu… F     <NA>              NA    NA   340        10     3    29
     5 Redfish… Hypanu… M     <NA>              NA    NA   810        10     4    29
     6 Corpus_… Carcha… F     <NA>             609   670   820        15     1    30
     7 Corpus_… Sphyrn… M     <NA>             495   485   615        10     2    24
     8 Corpus_… Sphyrn… F     <NA>             550   370   720        10     2    24
     9 Corpus_… Sphyrn… M     <NA>             470   505   645        10     3    24
    10 Corpus_… Sphyrn… F     <NA>             540   565   720        10     3    24
    # ℹ 804 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    How would you create a dataframe without Scalloped Hammerheads (Sphyrna lewini) entries?

    Sometimes you might want to select rows that match one of a set of values7. In this case we would use %in% to indicate “keep any of these”.

  • 7 Recall, the function c() (concatenate) creates a vector

  • catch %>%
      filter(Species %in% c("Sciades_felis", "Bagre_marinus", "Synodus_foetens"))
    # A tibble: 2,166 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Aransas… Bagre_… U     <NA>              NA   287   353        10     1    28
     2 Aransas… Bagre_… U     <NA>              NA   425   495        10     1    28
     3 Aransas… Bagre_… U     <NA>              NA   416   502        15     1    28
     4 Aransas… Bagre_… U     <NA>              NA   416   507        10     1    28
     5 Aransas… Bagre_… U     <NA>              NA   418   510        15     1    28
     6 Aransas… Bagre_… U     <NA>              NA   434   515        10     1    28
     7 Aransas… Bagre_… U     <NA>              NA   427   520        15     1    28
     8 Aransas… Bagre_… U     <NA>              NA   446   532        10     1    28
     9 Aransas… Bagre_… U     <NA>              NA   465   538        10     1    28
    10 Aransas… Bagre_… U     <NA>              NA   450   539        10     1    28
    # ℹ 2,156 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>

    Again, if you wanted everything but rows containing those values you would preface it with a !.

    catch %>%
      filter(!Species %in% c("Sciades_felis", "Bagre_marinus", "Synodus_foetens"))
    # A tibble: 159 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Redfish… Rhizop… M     <NA>             351   378   433        15     1    29
     2 Redfish… Sphyrn… F     <NA>             470   430   600        10     3    29
     3 Redfish… Hypanu… F     <NA>              NA    NA   670        15     3    29
     4 Redfish… Hypanu… F     <NA>              NA    NA   340        10     3    29
     5 Redfish… Hypanu… M     <NA>              NA    NA   810        10     4    29
     6 Corpus_… Carcha… F     <NA>             609   670   820        15     1    30
     7 Corpus_… Sphyrn… M     <NA>             495   485   615        10     2    24
     8 Corpus_… Sphyrn… F     <NA>             550   370   720        10     2    24
     9 Corpus_… Sphyrn… M     <NA>             470   505   645        10     3    24
    10 Corpus_… Sphyrn… F     <NA>             540   565   720        10     3    24
    # ℹ 149 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    How would you subset a data frame to contain only entries for different species in the genus Carcharhinus aka the sharky-sharks? There are four species in the data set - bullsharks (Carcharhinus leucas), spinner sharks (Carcharhinus brevipinna), blacktip sharks (Carcharhinus limbatus), and smalltail sharks (Carcharhinus porosus).

    For numbers you likely aren’t just searching for exact matches, you also want to be able to set threshold values and select everything above or below. For example, you can select all rows with values greater than a certain value using >.

    catch %>%
      filter(FL > 440)
    # A tibble: 907 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Aransas… Bagre_… U     <NA>              NA   446   532        10     1    28
     2 Aransas… Bagre_… U     <NA>              NA   465   538        10     1    28
     3 Aransas… Bagre_… U     <NA>              NA   450   539        10     1    28
     4 Aransas… Bagre_… U     <NA>              NA   496   565        10     1    28
     5 Aransas… Bagre_… U     <NA>              NA   476   569        10     1    28
     6 Aransas… Bagre_… U     <NA>              NA   495   570        10     1    28
     7 Aransas… Bagre_… U     <NA>              NA   490   575        10     1    28
     8 Aransas… Bagre_… U     <NA>              NA   486   581        10     1    28
     9 Aransas… Bagre_… U     <NA>              NA   503   589        10     1    28
    10 Aransas… Bagre_… U     <NA>              NA   489   590        10     1    28
    # ℹ 897 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    Create a data frame containing only entries with a forklength shorter than 300mm, then create a data frame that contains only entries with a forklength equal to or smaller than 300mm.

    In this second piece of code you used a single boolean operator to include two conditions, “smaller than” and “equal two”. That is a special case of wanting to retain data than fulfills one of either of two conditions and we have a specific boolean operator that can combine the two.

    This is not always the case, for example, you might want to retain data that fulfills conditions in two different columns. In this case you can combine expressions using & to indicate that it must fulfill all conditions indicated or | to indicate that it must retain at least one of the.

    For example to select only scalloped hammerheads that are also smaller than 300 cm you would use

    catch %>%
      filter(Species == "Sphyrna_lewini" & FL < 300)
    # A tibble: 1 × 12
      Site      Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
      <chr>     <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
    1 Corpus_C… Sphyrn… F     <NA>             192   210   280        15     3     6
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    How would you subset a data frame to contain only entries for different species in the genus Carcharhinus that have a forklength larger than 500 cm?

    By contrast, if you wanted all entries that are either gafftops or a fork length smaller than 300 cm you could use the following code:

    catch %>%
      filter(Species == "Sphyrna_lewini" | FL < 300)
    # A tibble: 409 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Aransas… Bagre_… U     <NA>              NA   287   353        10     1    28
     2 Aransas… Bagre_… U     <NA>              NA   285   314        10     2    28
     3 Corpus_… Bagre_… U     <NA>              NA   299   348        10     1    30
     4 Corpus_… Bagre_… U     <NA>              NA   297   367        10     2    30
     5 Corpus_… Bagre_… U     <NA>              NA   298   362        10     3    30
     6 Corpus_… Bagre_… U     <NA>              NA   290   350        10     2    24
     7 Redfish… Bagre_… U     <NA>              NA   254   284        10     4     8
     8 Aransas… Bagre_… U     <NA>              NA    50   574        10     1    25
     9 Aransas… Bagre_… U     <NA>              NA   280   340         3     3    25
    10 Redfish… Bagre_… U     <NA>              NA   294   353        10     4    16
    # ℹ 399 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>
    Give it a whirl

    How would you select fish that have a forklength either smaller than 200 cm or larger than 300cm?

    7.6 Sneak peak: grouping rows for specific wrangling actions

    We have already see that it can be helpful to subset rows based on conditions that are met by the content of more than one column. In those cases, we were creating conditions based on Boolean operators.

    In many cases we might be interested in subsetting a dataframe in a way where our conditions cannot be expressed by a TRUE/FALSE scenario using Boolean operators.

    For example, we might want to extract the data entry for the longest fish in the data set based on forklength.

    The function max() can be used to get the maximum value for a vector of numbers. In this case, the vector we are looking at is the FL column of the catch dataframe.

    catch %>%
      filter(FL == max(FL, na.rm = TRUE))
    # A tibble: 1 × 12
      Site      Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
      <chr>     <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
    1 Aransas_… Carcha… F     <NA>            1042  1140  1410        15     1    25
    # ℹ 2 more variables: Month <dbl>, Year <dbl>

    That’s great, now we now what the largest fish is that we caught.

    What about if we wanted to subset the dataframe to retain the largest fish based on forklength for each species?

    Consider this

    Conceptually lay out the individual steps that you would need to complete to do this (don’t worry about whether or not you actually know how to code this).

    The tidyverse has a central concept call “split-apply-combine”, which means that occasionally we want to group entries in a dataframe (split), do some sort of manipulation (apply), but end up with a single data frame (combine). We will look at how useful this is in the next chapter but let’s take a quick sneak peak at how this is implemented in dplyr using group_by().

    catch %>%
      group_by(Species) %>%
      filter(FL == max(FL, na.rm = TRUE)) %>%
      ungroup()
    # A tibble: 12 × 12
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Corpus_… Bagre_… U     <NA>              NA   575   640        10     2    24
     2 Aransas… Rhizop… F     <NA>             580   637   790        10     2    25
     3 Aransas… Carcha… F     <NA>            1042  1140  1410        15     1    25
     4 Redfish… Carcha… F     <NA>             812   900  1090        15     1    16
     5 Corpus_… Sphyrn… F     MAT              760   861  1090        10     2    12
     6 Aransas… Carcha… F     <NA>             690   757   940        10     2    22
     7 Corpus_… Sphyrn… F     <NA>             520   578   770        10     1    21
     8 Corpus_… Carcha… U     <NA>             335   415   475        10     1     3
     9 Aransas… Sciade… U     <NA>              NA   480   548        10     2    18
    10 Aransas… Sciade… U     <NA>              NA   480   580        10     2    18
    11 Corpus_… Sciaen… U     <NA>              NA   841   950        10     3    25
    12 Corpus_… Synodu… U     <NA>              NA   173   185        10     3    30
    # ℹ 2 more variables: Month <dbl>, Year <dbl>

    This is also an example of how we can use the pipe (%>%) to string a bunch of commands, in this example we are saying “take the object catch, and then group rows by Species and then for each group retain only the maximum forklength value for that group and then ungroup them again.”

    Give it a whirl

    How would you group rows by Species and then retain the individual caught on the largest hook size for each species?

    Protip

    Specifically for cases where we want to retain the largest or smallest values, we can use family of of functions called slice() which allow us to subset rows based on their position.

    For example, we can retain the largest 5 individuals per species based on forklength using slice_max()

    catch %>%
      group_by(Species) %>%
      slice_max(order_by = FL, n = 5)
    # A tibble: 66 × 12
    # Groups:   Species [14]
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Corpus_… Bagre_… U     <NA>              NA   575   640        10     2    24
     2 Corpus_… Bagre_… U     <NA>              NA   574   676        10     2    21
     3 Aransas… Bagre_… U     <NA>              NA   565    NA        10     1    17
     4 Redfish… Bagre_… U     <NA>              NA   564   651        15     1    29
     5 Aransas… Bagre_… U     <NA>              NA   555   541        10     3    13
     6 Aransas… Bagre_… U     <NA>              NA   555    NA        15     1    17
     7 Redfish… Carcha… F     <NA>             812   900  1090        15     1    16
     8 Redfish… Carcha… M     <NA>             792   882  1092        15     2    16
     9 Redfish… Carcha… M     <NA>             740   820  1020        10     4     1
    10 Redfish… Carcha… M     <NA>             660   722   880        10     1    20
    # ℹ 56 more rows
    # ℹ 2 more variables: Month <dbl>, Year <dbl>

    7.7 Create a subset with only unique entries

    Occasionally, you might want to create a subset of the data set that shows only the unique (distinct) entries for a specific column; this is especially common during an exploratory analysis of a data set that you are getting an overview of. This can be achieved using the function distinct().

    For example, we might want to know which years the survey took place.

    catch %>%
      distinct(Year)
    # A tibble: 4 × 1
       Year
      <dbl>
    1  2015
    2  2016
    3  2017
    4  2018

    Notice how that dropped all the other columns. You can switch that off using .keep_all = FALSE.

    Give it a whirl

    How would you produce a table with only one representative per species?

    You can also combine columns. For example if we wanted to determine the individual sets of the data set we could use

    catch %>%
      distinct(Day, Month, Year, Set)
    # A tibble: 197 × 4
         Day Month  Year   Set
       <dbl> <dbl> <dbl> <dbl>
     1    28     7  2015     1
     2    28     7  2015     2
     3    28     7  2015     3
     4    28     7  2015     4
     5    29     7  2015     1
     6    29     7  2015     2
     7    29     7  2015     4
     8    30     7  2015     1
     9    30     7  2015     2
    10    30     7  2015     3
    # ℹ 187 more rows
    Give it a whirl

    How would you produce a table showing all the species caught per station using distinct(), presented as arranged alphabetically by Site and Species within Site?